معماری سی پی یو اینتل (intel core architecture)

Majid Rahimpour(Ghaem PNU/computer architecture)

The Intel Core-I(7) Chip Architecture

The hot chip in today’s market is the Intel Sandy Bridge 64-bit chip. According to Intel’s documentation, this chip has the following features:

 64 bit processor

 Quad cores

 Each core is dual threaded

 4 issue out of order execution

 3 and 4 operand AVX instruction extensions

 32 nm manufacturing process

The chip is composed of a number of basic building blocks:

 3.4 GHz (max) Sandy Bridge core ( 4 units)

 32 KB L1D Cache (per code)

 256KB L2 Cache (per core)

 8MB L3 cache (per chip)

 DDR3 Memory controllers

 Diect Media Interface

 PCI-E 2.0 Controller

 0.85 GHz Gen 6 GPU

There are several interesting features of this architecture that are worth discussing before we leave the Pentium world.

The basic processing sequence

In a simple processor, we talked about processing using an analogy, the Texas Four-step;

 Fetch

o Place the RIP address on the memory bus

o Trigger a memory read operation

o Wait for the memory unit to respond

o Read up to 8 bytes from the data bus into an internal register

o Route the first few bytes to a decoder

 Decode

o Identify registers to be used by instruction

o Calculate next value for RIP

o Place address for additional data needed on memory bus

o Trigger a read operation

o Wait for the memory unit to respond

o Read up to 8 bytes from memory into internal register

 Execute

o Route data to ALU for processing

 Store

o Place address on address bus for memory write

o Place data on data bus

o Trigger write to memory

o Wait for memory unit to finish

o Route data to final internal registers

That is just a notion of the individual steps needed to complete the processing. If we need to wait until all are finished before we start another instruction, things can be pretty slow.

Designers of the actual hardware wanted this process to be as fast as possible. They also wanted it to generate as little heat as possible, and consume as little energy as possible. Clearly this is a tall order for hardware designers. But they came up with several tricks to do all of that.

How do we speed things up? Let’s look at the majos stages and see what they came up with!

Instruction fetching

Let’s start by looking at the actual process of fetching bits from memory.

Instructions are stored in the main memory of the system. Usually there is a lot of space in that memory, and it is pretty quick, but not compared with the speed of the processor itself. If we could decrease the access time to get at the instructions, we would speed up processing. The modern Pentium processor adds several layers of very high speed memory components to do just that. The idea is based on the simple notion of a high-speed cache! Cache memory

Suppose that accessing memory is really slow (actually, it is). In general, when you access a single item in memory, the odds are that you will access another item nearby. Rather than get one item at a time, let’s suppose that we grab a block of items. Obviously, it will take a bit of time to get this block, but we can take steps to speed that up, especially if they are in sequential order. (The Pentium architecture supports special devices to move blocks of memory around at high speeds.) We store this block in special high-speed memory and add a layer of logic between the processor and memory.

The new logic is pretty simple. Any time the processor requests an item from memory, a check is first made to see if the item is already in the cache. If so, it can be returned really quickly. If not, we still need to go to physical (slow) memory. The potential speed up is significant!

What about writing?

There is one problem with this. If we want to write data, we certainly need to write it back into the cache. But what about writing to main memory? In some cases, we might defer writing until a suitable time, but this runs the risk that we may get out of sync, especially if another processor (or core) tries to access the same information. The solution is to write-through the cache into main memory so both are synchronized. Modern processors try to avoid writing through, preferring to wait as long as possible

before paying this penalty. Once again, because we have sequential blocks of memory to write, we can use the special hardware to speed this up. Multi-level caches

If one level of cache is good, why not use two levels. In fact the Sandy Bridge uses three cache levels, the first two dedicated to operations on a single core, and the last shared among all cores on the chip!

We can extend this idea even further. We can use the hard disk as yet another level of cache, yielding something calledvirtual memory. Using this kind of cache needs support from the operating system, but all modern operating systems do just that! Memory heirarchy

We can summarize all these techniques using something called the memory hierarchy which shows how fast each stage is in relative terms:

 Registers - 1KB, 1 cycle access time

 L1 cache - 10KB, 10 cycle access time

 L2 Cache - 1MB, 100 cycle access time

 DRAM - 1GB, 1000 cycle access time

 Virtual memory - 1TB, 1M cycle access time

Obviously, getting data from memory can be speeded up if we use caches to store results we are likely to use soon.

Instruction decoding

Once the processor has fetched an instruction (or more) into its internal structures, the process of decoding the instruction can begin. The Pentium is called a Complex Instruction Set Computer (or CISC) because, well, its instructions can be pretty complex! Other machines are very simple. In facet another class of processor is called aReduced Instruction Set Computer ( or

RISC) because the instructions are very simple. Intel chose an interesting scheme to deal with its CISC instructions. Micro-operations

All instructions in the chip are translated into one or more micro-operations, or uops. These uops are, in fact, RISC level instrucitons. The advantage to this scheme is that a uop is much simpler to execute. By itself, this is a neat trick, as long as the set of uops supported by the chip is sufficiently complete.

Here is another diagram showing the internal components used for decoding the incoming byte stream into a set of uops:

Pipelining

Once the instructions are in uop form, we have a second set of instructions to process. The question now is how to speed up this processing.

One of the most important developments in the evolution of high speed processing was the concept of pipelining. The concept is pretty simple on the surface, but has gotten very sophisticated in modern processors!

Here is the basic idea:

Break the process up into independent stages! Here is a simple view of the scheme:

Each of these blocks perform a single part of the complete process. Once they have completed their work, they could start another operation. If we extend this idea, we can compress the total time needed for processing. The new scheme looks like this:

Here, the instruction fetch unit can begin working on the second instruction as soon as it has completed work on the first instruction. The overlapping instructions show how all parts of the process are kept busy (except for those few steps where the stream of instructions starts or completes. Most of the time, things go quicker!

The chip actually has many individual units of functionality internally which can be scheduled independently by the processor. So parallel operations in the chip are always happening. Just what we want in a pipelined processor.! Instruction reordering

The uop architecture has other advantages. One is that we can reorder the instructions and execute them in an order that takes maximum advantage of available hardware as long as everything is lined up later so we get the actual results we expect. Here is how it works:

Here are some sample instruction decodings into uops:

 ADD RAX, RBX - one uop

 ADD RAX, [mem] - 2 upos

o 1: read from memory into an unnamed register

o 2: add that temp to RAX

 ADD [mem], RAX - 3 uops

o 1 read from memory into temp register

o 2: add RAX to temp

o 3: write from temp to memory

A simple example of how that can be used to speed up operations is this sequence: mov rax, [mem1] imul rax, 5 add rax, [mem2] mov [mem3], rax

Using uops, we split the add instruction into two parts: mov rax, [mem1] imul rax, 5 mov tmp1, [mem2] add rax, tmp1 mov [mrm3], rax

If none of the required data is in a cache, we can reorder these instructions this way: mov rax, [mem1] mov tmp1, [mem2] imul rax, 5 add rax, tmp1 mov [mrm3], rax

Here, two pieces of data can be fetched from memory simultaneously. Or the multiply could happen while the second memory fetch is being performed. All that matters is that data is in the right spot before following instructions are executed. Register renaming

Sometimes, instructions seem to reference specific registers, but closer examination reveals that the operations are independent. Here is an example: mov rax, [mem1] imul rax, 6 mov [mem2], rax mov rax, [mem3] add rax, 2

mov [mem4], rax

If you look closely, the last three instructions are independent of the first three. Even though register RAX is used in all of these instructions. If the processor could use another register, we could do things in parallel! In fact, the Pentium has a number of available internal registers is can use for this purpose. Branch prediction

Another trick the Pentium uses involves predicting which way a branch will go. Doing this involves tracking how the instruction works over a series of executions. This will happen only when the instruction is processed as part of a loop.

The basic idea is pretty simple. We set up a special counter for each branch, and increment or decrement it each time the branch is processed. If the branch is taken at least four times, we assume it will be taken again. If the branch is not taken, we reduce the likelihood that it will be taken next time. Here is a diagram showing how the counter is used:

The final picture of instruction decoding looks like this. The cache system is used to speed up initial access to a set of instructions to be processed. The branch prediction unit helps decide what instructions are likely to be needed.

Once the instructions are in the chip, they can be routed to decoders. There are several decoder units available, and they can work in parallel to speed up the decoding process. The final result of all this is a queue of uops ready for processing. The next stage in the Pentium draws from this queue.

Instructions ready for processing sit in the uop queue until pulled in for actual processing. As this is done, registers can be renamed to make use of available hidden registers, and the sequencing of the uops can be modified to speed processing up. Notice that there is separate processing for integer and floating point operations. The last stage in this step is a set of buffers that hold memory and register access information.

The final scheduler blocks provide uops to the next stage where execution actually happens.

Execution

The next stage in the four-step is actual execution. Here, the processor has a number of execution units available depending on the specific uops to be processed. The instruction is handed to the correct unit and the output collected for further handling. By the time we reach this point, the actual execution process is pretty simple~

As you can see the actual execution units are different and provide support for many instructions we have not explored in this class.

Storing

The last step involves getting results back where they need to go. If the location is internal, the processor can move the data into place easily. If it needs to go back to memory, we need to engage the cache system again. Here is the final set of management blocks in the chip:

The scheduler manages routing od data items to the right spot for final handling.

:: موضوعات مرتبط: آﻣﻮزش و ﺗﺤﻘﯿﻘﺎت , ,